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1  Report  Structure 

In  the  next  section,  we  outline  the  problem  studied  under  this  grant.  In  section  3  we  describe  our 
accomplishments.  Section  4  shows  the  papers  published,  and  5  lists  students  who  have  been  funded 
by  this  grant. 

2  Problem  Studied 

The  literature  on  image  retrieval  is  growing,  with  several  efforts  in  both  academia  [9,  7,  13,  21,  18, 
26,  15,  22]  and  industry  [6,  25,  8].  The  main  thrust  of  our  work  is  the  definition  of  basic  image 
representations  that  are  most  appropriate  for  image  search.  With  the  aim  of  a  unified  treatment,  we 
have  developed  the  notion  of  a  signature  to  summarize  image  appearance.  Signatures  can  represent 
the  color,  shape,  or  texture  content  of  an  image.  They  are  more  flexible  than  feature  vectors  and 
histograms,  as  they  imply  no  fixed  number  or  ordering  of  feature  primitives,  as  in  vectors,  nor 
fixed-pitch  quantization  of  feature  values,  as  in  histograms.  Color,  shape,  and  texture  signatures 
are  described  in  sections  3.1,  3.4,  and  3.5. 

By  using  a  single  representation  format  for  the  three  different  modalities  considered  in  our  work, 
that  is,  color,  shape,  and  texture,  we  have  made  our  retrieval  mechanisms  essentially  uniform  across 
modalities.  This  has  led  not  only  to  efficiency  and  simplicity,  but  also  to  conceptual  consistency. 

The  other  main  ingredient  of  a  retrieval  system,  besides  signatures,  is  a  perceptually  meaningful 
measure  of  similarity  between  two  images.  We  have  defined  such  a  measure  based  on  what  we 
call  the  “Earth  Mover’s  Distance”  (section  3.2).  With  these  two  ingredients,  the  pictures  in  a 
database  can  be  organized  so  as  to  keep  similar  images  close  to  each  other.  In  this  context,  we  have 
developed  efficient  data  structures  for  sublinear  nearest-neighbor  retrieval.  In  addition,  a  similarity 
metric  between  images  leads  to  methods  for  laying  out  either  all  the  images  in  the  database,  or 
a  sample  thereof,  or  a  small  number  of  mutually  related  images,  and  for  displaying  these  images 
in  an  intuitive  way  for  the  user.  The  mathematical  tool  we  used  for  the  creation  of  this  layout  is 
multi-dimensional  scaling  (MDS). 

In  shape-based  retrieval,  we  have  used  shape  information  in  the  presence  of  occlusions  to  re¬ 
trieve  drawings  from  various  colelctions,  and  we  have  developed  shape  indices  by  recording  what 
basic  shape  appears  where  in  the  image.”  We  successfully  experimented  with  data-bases  of  illustra¬ 
tions  from  geometry  textbooks,  and  of  scanned-in  Chinese  characters.  For  this  work,  we  extended 
geometric  hashing  techniques  to  make  our  indices  invariant  under  a  transformation  group. 

We  have  built  a  web-based  retrieval  system  that  allows  fast  retrieval  from  a  20,000  image 
database  based  on  color  signatures  (section  3.6).  Furthermore,  we  demonstrated  the  notion  of 
a  database  navigator,  in  which  many  of  the  images  in  a  database  are  laid  out  in  three-dimensional 
space  (section  3.6).  The  user  then  navigates  in  this  space  with  a  joystick.  The  main  advantage  of 
this  new  interaction  paradigm  is  that  the  content  of  the  database  is  conveyed  to  the  user  all  at  once, 
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rather  than  piecemeal,  as  in  the  more  standard  query /response  protocol.  A  global  view  lets  the 
user  form  a  mental  picture  of  the  database,  just  a s  one  forms  a  mental  picture  of  the  contents  of, 
say,  a  bookstore  by  browsing  in  it  for  some  time.  If  the  images  are  arranged  in  a  coherent  fashion, 
consistent  with  our  similarity  metric,  the  ordering  rationale  is  easily  learned  by  the  user  without 
being  explicitly  identified.  At  a  more  local  level,  again  thanks  to  our  metric,  the  small  number  of 
images  returned  in  response  to  a  query  in  the  more  traditional  query/response  operating  mode  can 
be  displayed  so  as  to  emphasize  similarities  and  differences  among  the  images. 

A  retrieval  system  we  developed  for  police  mugshots  (section  3.7)  demonstrates  the  usefulness 
of  signatures,  EMD,  and  our  navigation  tools  for  a  real-world  application. 

In  the  following  section,  we  outline  the  main  achievements  of  our  work. 


3  Summary  of  Results 

3.1  Color  Signatures 

The  color  information  of  each  image  is  reduced  to  a  compact  representation  that  we  call  the  signature 
of  the  image.  In  general  a  signature  contains  a  varying  number  of  points  in  a  Euclidean  space  where 
a  weight  is  attached  to  each  point.  In  the  case  of  color  images,  the  points  represent  clusters  of  similar 
colors  in  CIE-LAB  space,  and  the  weight  of  a  point  is  the  fraction  of  the  image  area  with  that  color. 
The  signatures  thus  obtained  are  compact:  the  color  distribution  of  ah  entire  image  is  summarized 
by  a  handful  of  points,  typically  eight  to  twelve.  Since  signatures  represent  distributions  in  the 
CIE-LAB  color  space,  they  are  perceptually  significant,  in  that  Euclidean  distances  between  points 
are  strongly  correlated  with  perceptual  differences.  Because  of  clustering,  small  variations  in  the 
colors  of  an  image  have  little  effect  on  signatures,  thereby  providing  a  moderate  degree  of  invariance 
to  changes  of  viewpoint  and  lighting.  Finally,  signatures  are  simple  and  flexible  abstractions.  In 
fact,  the  cloud  of  weighted  points  that  makes  up  a  color  signature  lives  in  the  low-dimensional  space 
of  colors.  Furthermore,  just  as  objects  and  concepts  are  described  in  English  by  sentences  with  a 
variable  number  of  words,  so  images  are  summarized  by  a  variable  number  of  colors  in  a  signature. 
The  ordering  of  colors  in  not  meaningful,  and  is  therefore  not  used.  The  relative  importance  of  the 
various  colors  is  explicitly  represented  by  the  weight  of  each  signature  component,  and  is  therefore 
immune  from  the  quantization  problems  inherent  in  color  histograms. 

3.2  The  Earth  Mover’s  Distance 

We  define  the  distance  between  two  signatures  to  be  the  minimum  amount  of  ‘work’  needed  to 
transform  one  signature  into  the  other.  The  work  needed  to  move  a  point,  or  a  fraction  of  a  point, 
to  a  new  location  is  the  portion  of  the  weight  being  moved,  multiplied  by  the  Euclidean  distance 
between  the  old  and  the  new  locations.  When  changing  one  signature  to  another,  the  work  is  the 
sum  of  the  work  done  by  moving  the  weights  of  the  individual  points  of  the  source  signature  to 
those  of  the  destination  signature.  We  allow  the  weight  of  a  single  source  signature  point  to  be 
partitioned  among  several  destination  signature  points,  and  vice  versa.  The  distance  between  the 
source  and  destination  signatures  is  then  defined  to  be  the  minimum  amount  of  work  necessary 
to  thus  move  the  weight  of  the  source  to  that  of  the  destination  signature.  We  call  this  distance 
function  the  earth  mover’s  distance . 

Computing  the  earth  mover’s  distance  can  be  formulated  as  a  linear  programming  (LP)  problem 
[16].  Given  the  compact  nature  of  color  signatures,  this  LP  problem  is  relatively  small.  Still,  since 
computing  this  distance  is  the  main  operation  in  our  image  retrieval  systems,  we  are  devoting 
considerable  efforts  to  making  this  solution  as  fast  as  possible.  The  distance  between  two  images  is 
computed  in  a  few  milliseconds.  We  have  developed  bounds  that  can  be  used  both  to  exclude  from 
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consideration  images  that  are  too  distant  from  the  query  and  to  abort  computation  of  a  distance 
once  it  is  certain  to  exceed  a  certain  value. 


3.3  Color  Metric  Comparisons 

Multidimensional  distributions  are  often  used  in  computer  vision  to  describe  and  summarize  the 
color  content  of  an  image.  Given  two  distributions  of  colors,  it  is  often  useful  to  define  a  quanti¬ 
tative  measure  of  their  dissimilarity,  with  the  intent  of  approximating  perceptual  dissimilarity  as 
well  as  possible.  This  is  particularly  important  in  image  retrieval  applications,  but  has  fundamen¬ 
tal  implications  also  for  the  understanding  of  color  perception.  Defining  a  distance  between  two 
distributions  requires  first  a  notion  of  distance  between  the  basic  features  that  are  aggregated  into 
the  distributions.  We  call  this  distance  the  ground  distance.  For  instance,  in  the  case  of  color, 
the  ground  distance  measure  dissimilarity  between  individual  colors.  Fortunately,  color  ground 
distance  has  been  carefully  studied  in  the  literature  of  psychophysics,  and  has  led  to  measures  like 
the  CIE-Lab  color  space  [27]. 

Given  a  ground  distance,  several  measures  have  been  proposed  in  the  literature  for  the  perceptual 
dissimilarity  of  color  distributions  and  distributions  in  general.  In  this  research,  we  surveyed  some  of 
these  measures,  and  compared  them  with  the  Earth  Mover’s  Distance  (EMD),  a  metric  we  proposed 
in  [17].  The  EMD  can  be  applied  both  to  histograms  and  to  so-called  signatures,  which  are  more 
flexible  representations  of  distributions.  We  showed  that  the  combination  of  EMD  with  signatures 
works  best  for  image  retrieval. 

3.4  Shape-Based  Illustration  Indexing  and  Retrieval 

We  have  developed  a  general  set  of  ideas  for  indexing  computer-generated  technical  illustrations 
based  on  the  shapes  present  in  them,  so  that  they  can  be  efficiently  retrieved  later  using  as  the 
key  other  'similar- looking’ illustrations  (either  pre-existing,  or  interactively  drawn  by  the  user).  We 
have  restricted  out  attention  to  the  domain  of  computer-generated  technical  illustrations  for  now, 
where  shape  information  is  both  precisely  available  and  the  main  way  in  which  pictorial  meaning  is 
conveyed.  After  we  have  techniques  that  can  operate  successfully  in  this  domain,  we  plan  to  port 
them  to  other  kinds  of  pictorial  or  image  data  as  well  by  applying  shape  extraction  techniques  from 
computer  vision. 

We  proceed  as  follows:  given  an  illustration  P,  we  compute  a  compact  index  i(P)  which  records 
the  principal  shapes  present  in  P  and  their  location/orientation/size.  Then  given  a  collection  of 
illustrations,  we  compute  a  data-structure  for  recording  their  indices  so  that  later  queries  can  be 
answered  efficiently.  At  retrieval  time  we  are  given  another  illustration  Q;  we  compute  i(Q)  and 
then  search  the  data  base  for  illustrations  P:  whose  index  i(Pt)  is  'similar’  to  i(Q). 

An  illustration  P  for  us  is  a  collection  of  instanced  graphics  primitives  (lines  or  polylines, 
circular  arcs,  Bezier  cubic  or  B-spline  arcs,  marks,  etc.),  as  is  almost  universally  the  case  with 
the  illustrators  in  common  use  today  (e.g.,  Adobe  Illustrator,  Aldus  Freehand,  Xfig,  etc.).  We  start 
with  a  a  collection  of  basic  shapes  which  may  be  built-in,  or  user-definable.  In  the  index  i(P)  of  an 
illustration  P  we  record  'which  basic  shapes  appear  where.’  In  other  words,  for  each  basic  shape, 
we  record  in  the  index  the  translation,  rotation,  and  scale  transformations  which  cause  this  basic 
shape  to  match  well  some  of  the  shapes  present  in  P,  according  to  the  Hausdorff  distance  [3].  Thus 
we  can  think  of  the  index  as  a  list  of  ‘colored’  points  in  P4,  where  the  four  coordinates  are  the  four 
parameters  defining  the  transformation,  and  the  color  is  the  label  of  the  basic  shape  involved.  (We 
actually  store  the  logarithm  of  the  scale  parameter,  so  as  to  make  variations  in  scale  correspond  to 
point  translations  in  P4,  just  like  for  translations  and  rotations). 

When  a  query  illustration  comes  in,  we  compute  its  index  t(Q)  in  the  same  way.  At  the  moment 
we  match  i(Q)  with  the  index  of  every  illustration  in  the  data  base,  by  computing  in  P4  the  colored 
one-way  Hausdorff  distance  under  translation  between  the  two  point  sets  representing  the  indices; 
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a  fuller  explanation  of  the  matching  mechanism  is  given  in  [5].  We  are  optimistic  that  in  the  future 
we  will  be  able  to  attain  sublinear  query-time  algorithms  (algorithms  which  do  not  need  to  compare 
l(Q)  with  every  other  illustration  index)  by  using  computational  geometric  techniques  on  the  set 
of  indices  —  essentially  by  clustering  illustrations  whose  indices  have  a  small  'distance’  from  each 
other. 

A  library  of  approximately  two-hundred  illustrations  from  a  geometry  textbook  was  indexed 
using  this  scheme  and  then  used  for  retrieval  experiments.  An  interactive  interface  was  provided  for 
specifying  the  data  base  to  be  searched  and  the  query  illustration,  for  setting  various  parameters 
regarding  the  match,  and  for  displaying  the  best  matches  found  in  the  data  base.  Details  and 
examples  are  provided  in  [5]. 

3.5  Texture  Metrics 

Similarity  measures  between  textures  are  important  for  image  understanding  applications  such  as 
content-based  image  retrieval,  texture  segmentation,  and  texture  classification.  In  order  to  be 
useful,  it  is  important  that  these  similarity  measures  correspond  to  human  texture  perception.  In 
addition,  in  image  retrieval  it  is  often  crucial  that  the  similarity  distances  be  metric ,  so  that  efficient 
data  structures  and  search  algorithms  [2,  4]  can  be  used. 

In  this  research  we  defined  a  class  of  texture  metrics  based  on  texture  features  close  to  the  model 
of  simple  cells  in  the  primary  visual  cortexfll].  For  the  distance  between  texture  feature  histograms 
we  used  the  Earth  Mover’s  Distance  (EMD),  an  effective  and  efficient  measure  of  histogram  differ¬ 
ences  [17].  We  evaluated  our  metrics  both  quantitatively,  by  examining  the  actual  distances  between 
different  textures,  and  qualitatively,  by  using  multidimensional  scaling  techniques  [24]  to  find  what 
are  the  texture  properties  that  affect  our  metrics  the  most,  and  to  “visualize”  the  metrics.  We 
obtained  similar  results  to  those  found  by  psychophysical  experiments  [23,  14],  thereby  confirming 
our  claim  that  our  texture  metrics  correspond  to  human  perception. 

3.6  Database  Navigation 

The  user  of  an  image  retrieval  system  would  typically  like  to  specify  queries  in  semantic  terms  (e.g. 
“children  playing  in  a  park”).  Unfortunately,  the  state-of-art  in  computer  vision  does  not  yet  allow 
for  such  queries.  Instead,  systems  use  simpler  syntactic  image  features  such  as  color,  texture  and 
shape  [6,  1,  7,  12,  13],  in  the  hope  that  these  correlate  well  with  semantic  features.  This  discrepancy 
between  syntactic  and  semantic  queries  causes  a  basic  problem  with  the  traditional  query /response 
style  of  interaction.  An  overly  generic  query  yields  a  large  jumble  of  images,  which  are  hard  to 
examine,  while  an  excessively  specific  query  may  cause  many  good  images  to  be  overlooked  by  the 
system.  This  is  the  traditional  trade-off  between  good  precision  (few  false  positives)  and  good  recall 
(few  false  negatives).  Striving  for  both  good  precision  and  good  recall  may  pose  an  excessive  burden 
on  the  definition  of  a  “correct”  measure  of  image  similarity.  While  most  image  retrieval  systems, 
including  the  ones  above,  recognize  this  and  allow  for  an  iterative  refinement  of  queries,  the  number 
of  images  returned  for  each  query  is  usually  kept  low  so  that  the  user  can  examine  them  one  at  a 
time. 

In  contrast,  we  suggested  that  with  an  appropriate  display  technique,  which  is  the  main  point  of 
this  research,  many  more  images  can  be  returned  without  overloading  the  user’s  attention.  Specif¬ 
ically,  if  images  can  be  arranged  on  the  screen  so  as  to  reflect  similarities  and  differences  between 
their  color  distributions,  the  initial  queries  can  be  very  generic,  and  return  a  large  number  of  images. 
The  consequent  low  initial  precision  is  an  advantage  rather  than  a  weakness.  In  fact,  the  user  can 
see  large  portions  of  the  database  at  a  glance,  and  form  a  global  mental  model  of  what  is  in  it. 
Rather  than  following  a  thin  path  of  images  from  query  to  query,  as  in  the  traditional  approach, 
the  user  now  zooms  in  to  the  images  of  interest.  Precision  is  added  incrementally  in  subsequent 
query  refinements,  and  fewer  and  fewer  images  are  displayed  as  the  desired  images  are  approached. 
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In  our  system,  we  use  the  distributions  of  colors  in  images  as  our  retrieval  features.  These  have 
been  shown  [22,  6,  20,  1,  7,  12,  13]  to  be  useful  retrieval  cues.  When  a  (usually  vague)  query  is 
specified  or  drawn  by  the  user,  we  locate  and  display  a  large  number  of  neighboring  images  in  the 
database.  Since  queries  in  our  system  are  image-like,  neighborhood  can  be  defined  in  terms  of  the 
distance  between  images.  The  resulting  images  are  then  used  for  more  focused  queries  that  return 
fewer  and  fewer  images.  At  every  step,  query  results  are  embedded  in  two-dimensional  space  by 
using  multi-dimensional  scaling  (MDS)  [19,  10],  by  which  we  place  picture  thumbnails  on  the  screen 
so  that  screen  distances  reflect  as  closely  as  possible  the  distances  between  the  images.  While  more 
traditional  displays  list  images  in  order  of  similarity  to  the  query,  thereby  representing  n  distances  if 
n  images  are  returned,  our  display  conveys  information  about  all  the  (”)  distances  between  images. 
This  display  makes  it  easy  for  the  user  to  grasp  the  entire  set  of  returned  images  at  a  glance, 
understand  how  the  query  actually  performed,  and  decide  where  to  go  next.  In  fact,  such  geometric 
embeddings  allow  the  user  to  perceive  the  dominant  axes  of  variation  in  the  displayed  image  group. 
When  the  user  selects  a  region  of  interest  on  the  display,  a  new,  more  specific  query  is  automatically 
generated,  and  returns  a  smaller  set  of  images.  These  are  again  displayed  by  a  new  MDS,  which 
now  reflects  the  new  dominant  axes  of  variation.  Thus,  the  embeddings  are  adaptive ,  in  the  sense 
that  they  use  the  screen’s  real  estate  to  emphasize  whatever  happen  to  be  the  main  differences 
and  similarities  among  the  particular  images  at  hand.  By  iterating  this  process,  the  user  is  able  to 
quickly  navigate  to  the  portion  of  the  image  space  of  interest,  typically  in  very  few  mouse  clicks. 

3.7  Mugshot  Retrieval 

A  police  mugshot  retrieval  system  was  developed  as  a  feasibility  test  for  a  Canadian  police  depart¬ 
ment.  In  order  to  identify  a  suspect  in  a  crime,  witnesses  are  often  asked  to  scan  large  collections  of 
police  mugshots.  Fatigue  and  insufficient  attention  span  can  lead  to  distraction  during  this  process. 
A  system  that  lets  witnesses  navigate  through  the  mugshot  collection  in  a  more  coherent  fashion 
can  help  reduce  the  likelihood  of  costly  mistakes. 

The  witness  gives  general  indications  about  the  appearance  of  the  suspect,  such  as  age  group, 
sex,  race,  and  hair  color.  The  system  then  displays  many  relevant  images  as  small  thumbnail  icons 
on  a  single  screen,  arranging  them  in  such  a  way  that  similar  faces,  in  terms  of  the  attributes  of 
importance  for  the  given  search,  appear  close  to  each  other  on  the  screen.  Because  similar  images 
are  nearby,  it  becomes  much  easier  for  a  witness  to  concentrate  his  or  her  attention  on  the  part  of 
display  of  interest.  By  selecting  an  “interesting”  part  of  the  display,  the  system  produces  a  new 
display,  with  images  that  are  similar  to  those  in  the  interesting  part.  By  repeating  this  procedure, 
the  witness  can  home  in  to  the  image  of  the  suspect  in  a  few  steps. 

In  order  to  apply  our  perceptual  navigation  algorithms,  we  needed  a  set  of  images  and  a  similarity 
measure  between  them.  The  image  set  was  provided  by  a  Canadian  police  department.  We  used 
a  very  simple  feature-based  similarity  measure  between  mugshots,  based  on  simple,  pre-annotated 
features  provided  by  the  police.  We  defined  a  distance  measure  for  every  feature,  and  the  similarity 
measure  between  two  mugshots  was  defined  as  a  linear  combination  of  these  distances.  The  relative 
importance  of  the  features  was  controlled  by  modifying  the  weights  of  each  feature.  The  user  could 
also  “turn  off’  features  which  should  not  participate  in  the  computation  of  the  similarity  measure. 

4  Publications 

The  following  publications  have  been  produced  during  this  grant. 

R.  Manduchi  and  C.  Tomasi.  Distinctiveness  maps  for  image  matching.  In  10th  International 
Conference  on  Image  Analysis  and  Processing  (ICIAP)y  Venice,  Italy,  September  1999,  pages  26-31. 

Y.  Rubner  and  C.  Tomasi.  Texture-based  image  retrieval  without  segmentation.  In  Seventh 
International  Conference  on  Computer  Vision  (ICCV),  Kerkyra,  Greece,  September  1999,  pages 
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